Efficient representation as a design principle for neural coding and computation 
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Does the brain construct an efficient representation of the sensory world? We review progress 
on this question, focusing on a series of experiments in the last decade which use fly vision as 
a model system in which theory and experiment can confront each other. Although the idea of 
efficient representation has been productive, clearly it is incomplete since it doesn't tell us which 
bits of sensory information are most valuable to the organism. We suggest that an organism which 
maximizes the (biologically meaningful) adaptive value of its actions given fixed resources should 
have internal representations of the outside world that are optimal in a very specific information 
theoretic sense: they maximize the information about the future of sensory inputs at a fixed value 
of the information about their past. This principle contains as special cases computations which 
the brain seems to carry out, and it should be possible to test this optimization directly. We return 
to the fly visual system and report the results of preliminary experiments that are in encouraging 
agreement with theory. 



I. INTRODUCTION 

Since Shannon's original work [T] there has been the 
hope that information theory would provide not only 
a guide to the design of engineered communication sys- 
tems but also a framework for understanding information 
processing in biological systems. One of the most con- 
crete implementations of this idea is the proposal that 
computations in the brain serve to construct an efficient 
(perhaps even maximally efficient) representation of in- 
coming sensory data [21 [3 0]. Since efficient coding 
schemes are matched, at least implicitly, to the distri- 
bution of input signals, this means that what the brain 
computes — perhaps down to the properties of individ- 
ual neurons — should be predictable from the statistical 
structure of the sensory world. This is a very attractive 
picture, and points toward general theoretical principles 
rather than just a set of small models for different small 
pieces of the brain. More precisely, this picture suggests 
a research program that could lead to an experimentally 
testable theory. 

Our research efforts, over several years, have been in- 
fluenced by these ideas of efficient representation. On 
the one hand, we have found evidence for this sort of 
optimization in the responses of single neurons in the fly 
visual system, especially once we developed tools for ex- 
ploring the responses to more naturalistic sensory inputs. 
On the other hand, we have been concerned that simple 
implementations of information theoretic optimization 
principles must be wrong, because they implicitly at- 
tach equal value to all possible bits of information about 
the world. In response to these concerns, we have been 
trying to develop alternative approaches, still grounded 
in information theory but not completely agnostic about 
the value of information. Guided by our earlier results, 



we also want to phrase these theoretical ideas in a way 
that suggests new experiments. 

What we have outlined here is an ambitious program, 
and certainly we have not reached anything like com- 
pletion. The invitation to speak at the International 
Symposium on Information Theory in 2006 seemed like 
a good occasion for a progress report, so that is what 
we present here. It is much easier to convey the sense 
of 'work in progress' when speaking than when writing, 
and we hope that the necessary formalities of text do not 
obscure the fact that we are still groping for the correct 
formulation of our ideas. We also hope that, incomplete 
as it is, others will find the current state of our under- 
standing useful and perhaps even provocative. 



II. SOME RESULTS FROM THE FLY VISUAL 
SYSTEM 

The idea of efficient representation in the brain has 
motivated a considerable amount of work over several 
decades. We begin by reviewing some of what has been 
done along these lines, focusing on one experimental test- 
ing ground, the motion sensitive neurons in the fly visual 
system. 

Many animals, in particular those that fly, rely on vi- 
sual motion estimation to navigate through the world. 
The sensory-motor system responsible for this task, 
loosely referred to as the optomotor control loop, has 
been the subject of intense investigation in the fly, both 
in behavioral [5] and in electrophysiological studies. In 
particular, Bishop and Keehn [6] described wide field 
motion sensitive cells in the fly's lobula plate, and some 
neurons of this class have been directly implicated in 
optomotor control [7 . The fly's motion sensitive vi- 
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sual neurons thus are critical for behavior, and one can 
record the action potentials or spikes generated by indi- 
vidual motion sensitive cells (e.g., the cell HI, a lobula 
plate neuron selective for horizontal inward motion) us- 
ing an extracellular tungsten microelectrode, and stan- 
dard electrophysiological methods [5]; unlike most such 
recordings, in the fly one can record stably and continu- 
ously for days. 

The extreme stability of the HI recordings has made 
this system an attractive testing ground for a wide vari- 
ety of issues in neural coding and computaton. In par- 
ticular, for HI it has been possible to show that: 

1. Sequences of action potentials provide large 
amounts of information about visual inputs, within 
a factor of two of the limit set by the entropy of 
these sequences even when we distinguish spike ar- 
rival times with millisecond resolution [9 . 

2. This efficiency of coding has significant contribu- 
tions from temporal patterns of spikes that provide 
more information than expected by adding up the 
information carried by individual spikes |10j . 

3. Although many aspects of the neural response vary 
among individual flics, the efficiency of coding is 
nearly constant [IX] . 

4. Information rates and coding efficiencies are 
higher, and the high efficiency extends to even 
higher time resolution, when we deliver stimulus 
ensembles that more closely approximate the stim- 
uli which flies encounter in nature [HI [T3] . 

5. The apparent input /output relation of these neu- 
rons changes in response to changes in the in- 
put distribution. For the simple case where we 
change the dynamic range of velocity signals, the 
input/output relation rescales so that the signal 
is encoded in relative units; the magnitude of 
the rescaling factor maximizes information transfer 

M- 

6. In order to adjust the input/output relation reli- 
ably, the system has to collect enough samples to 
be sure that the input distribution has changed. In 
fact the speed of adaptation is close to this theo- 
retical limit [17 . 

All of these results point toward the utility of efficient 
representation as a hypothesis guiding the design of new 
experiments, and perhaps even as a real theory of the 
neural code. So much for the good news. 

III. ON THE OTHER HAND ... 

Despite the successes of information theoretic ap- 
proaches to the neural code in fly vision and in other 
systems, we must be honest and consider the funda- 
mental stumbling blocks in any effort to use informa- 
tion theoretic ideas in the analysis of biological systems. 



First, Shannon's formulation of information theory has 
no place for the value or meaning of the information. 
This is not an accident. On the first page of his 1948 
paper [T], Shannon remarked (italics in the original): 

Frequently the messages have meaning; that 
is they refer to or are correlated according 
to some system with certain physical or con- 
ceptual entities. These semantic aspects of 
the communication are irrelevant to the en- 
gineering problem. 

Yet surely organisms find some bits more valuable than 
others, and any theory that renders meaning irrelevant 
must miss something fundamental about how organisms 
work. Second, it is difficult to imagine that evolution can 
select for abstract quantities such as the number of bits 
that the brain extracts from its sensory inputs. Both of 
these problems point away from general mathematical 
structures toward biological details such as the fitness or 
adaptive value of particular actions, the costs of particu- 
lar errors, and the resources needed to carry out specific 
computations. It would be attractive to have a theoret- 
ical framework that is faithful to these biological details 
but nonetheless derives predictions from more general 
principles. 

To develop a biologically meaningful notion of opti- 
mization, we should start with the idea that there is 
some metric ('adaptive value' in Fig [lj for the quality 
or utility of the actions taken by an organism, and that 
there are resources that the organism needs to spend in 
order to take these actions and to maintain the appara- 
tus that collects and processes the relevant sensory in- 
formation. The ultimate metric is evolutionary fitness, 
but in more limited contexts one can think about the 
frequency or value of rewards and punishments, and in 
experiments one can manipulate these metrics directly. 
Costs often are measured in metabolic terms, but one 
also can measure the volume of neural circuitry devoted 
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FIG. 1: Optimization from a biological point of view. 
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to a task. Presumably there also are costs associated 
with the development of complex structures, although 
these are difficult to quantify. 

Given precise definitions of utility and cost for dif- 
ferent strategies (whether represented by neurons or by 
genomes), the biologically meaningful optimum is to 
maximize utility at fixed cost: While there may be no 
global answer to the question of how much an organism 
should spend, there is a notion that it should receive 
the maximum return on this investment. Thus within a 
given setting there is a curve that describes the maxi- 
mum possible utility as a function of the cost, and this 
curve divides the utility/cost plane into regions that are 
possible and impossible for organisms to achieve, as in 
Fig [T| This curve defines a notion of optimal perfor- 
mance that seems well grounded in the facts of life, even 
if we can't compute it. The question is whether we can 
map this biological notion of optimization into some- 
thing that has the generality and power of information 
theory. 



IV. COSTS AND BENEFITS ARE RELATED 
TO INFORMATION 

To begin, we note that taking actions which achieve a 
criterion level of fitness requires a minimum number of 
bits of information, as schematized in the upper left of 
Fig [2] Consider an experiment in which human subjects 
point at a target, and the reward or utility is depen- 
dent upon the positional error of the pointing. We can 
think of the motor neurons, muscles and kinematics of 
the arm together as a communication channel that trans- 
forms some central neural representation into mechanical 
displacements. If we had an accurate model of this com- 
munication channel we could calculate its rate-distortion 
function, which determines the minimum number of bits 
required in specifying the command to insure displace- 
ments of specified accuracy across a range of possible tar- 
get locations. The rate-distortion function divides the 
utility/information plane into accessible and inaccessible 
regions. 

It also is true that bits are not free. In the classical ex- 
amples of communication channels, the signal-to-noise 
ratio (SNR) with which data can be transmitted is re- 
lated directly to the power dissipation, and the SNR in 
turn sets the maximum number of bits that can be trans- 
mitted in a given amount of time; this is (almost) the 
concept of channel capacity. If we think about the bits 
that will be used to direct an action, then there are many 
costs — the cost of acquiring the information, of repre- 
senting the information, and the more obvious physical 
costs of carrying out the resulting actions. Continuing 
with the example of motor control, we always can assign 
these costs to the symbols at the entrance to the commu- 
nication channel formed by the motor neurons, muscles 
and arm kinematics. The channel capacity separates the 
information/cost plane into accessible and inaccessible 
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FIG. 2: Biological costs and benefits are connected to bits. 
The upper right quadrant redraws the biologically motivated 
notion of optimization from Fig[l] trading resources for adap- 
tive value. In the upper left, we show schematically that 
achieving a given quality of performance requires a minimum 
number of bits, in the spirit of rate-distortion theory. In the 
lower right, we show that a given resource expenditure will 
suffice only to collect a certain maximal number of bits, in 
the spirit of channel coding. Through these connections, the 
quantities that govern biological optimization are translated 
into bits. But are these the same bits? 

regions, as in the lower right quadrant of Fig[2[ Ideas 
about metabolically efficient neural codes [18j [19] can be 
seen as efforts to calculate this curve in specific models. 



V. CAN WE CLOSE THE LOOP? 

To complete the link between biological optimization 
and information theoretic ideas, we need to remem- 
ber that there is a causal path from information about 
the outside world to internal representations to actions. 
Thus the adaptive value of actions always depends on the 
state of the world after the internal representation has 
been formed, simply because it takes time to transform 
representations into actions; the only bits that can con- 
tribute to fitness are those which have predictive power 
regarding the future state of the world. In contrast, be- 
cause of causality, any internal representation necessarily 
is built out of information about the past. 

The fact that representations are built from data 
about the past but are useful only to the extent that 
they provide information about the future means that, 
for the organism, the bits in the rate-distortion tradeoff 
are bits about the future, while the bits in the channel 
capacity tradeoff are bits about the past. Thus the differ- 
ent tradeoffs we have been discussing — the biologically 
relevant trade between resources and adaptive value, the 
rate-distortion relation between adaptive value and bits, 
and the channel capacity trading of resources for bits — 
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FIG. 3: Connecting the different optimization principles. 
Lines indicate curves of optimal performance, separating al- 
lowed from forbidden (hashed) regions of each quadrant. In 
the upper right quadrant is the biological principle, maximiz- 
ing fitness or adaptive value at fixed resources. But actions 
that achieve a given level of adaptive value require a minimum 
number of bits, and since actions occur after plans these are 
bits about the future (upper left). On the other hand, the or- 
ganism has to "pay" for bits, and hence there is a minimum 
resource costs for any representation of information (lower 
right). Finally, given some bits (necessarily obtained from 
observations on the past), there is some maximum number of 
bits of predictive power (lower left). To find a point on the 
biological optimum one can try to follow a path through the 
other three quadrants, as indicated by the arrows. 



form three quadrants in a plane (Fig j3|. The fourth 
quadrant, which completes the picture, is a purely in- 
formation theoretic tradeoff between bits about the past 
and bits about the future. 

Assuming that the organism lives in a statistically 
stationary world, predictions ultimately are limited by 
the statistical structure of the data that the organism 
collects. More concretely, if we observe a time series 
through a window of duration T (that is, for times 
— T < t < Q), then to represent the data X past we 
have collected requires S(T) bits, where S is the entropy, 
but the information that these data provide about the 
future future (i-e., at times t > 0) is given by some 
I(X past ; Xf u turc) = I P rcd{T) < S(T). In particular, 
while for large T the entropy S(T) is expected to be- 
come extensive, the predictive information J prec j(T) al- 
ways is subextensive [20] . Thus we expect that the data 
Xp a st can be compressed significantly into some internal 
representation X- mt without losing too much of the rele- 
vant information about Af utU rc- This problem — mapping 
Xpast — > Xi n t to minimize the information I(X int ; X past ) 
that we keep about the past while maintaining informa- 
tion I(Xi nt ; Xfuturo) about the future — is an example of 
the "information bottleneck" problem [5T]. Again there 
is a curve of optimal performance, separating the plane 



into allowed and forbidden regions. Formally, we can 
construct this optimum by solving 

max [I(X int ;X iutuIC ) - XI(X int ; X past )} , (1) 

where X past — > X- mt is the rule for creating the internal 
representation and A is a Lagrange multiplier. 

We see that there are several different optimization 
principles, all connected, as schematized in Fig [3] The 
biologically relevant principle is to maximum the fitness 
F given some resource cost C. But in order to take ac- 
tions that achieve some mean fitness F in a potentially 
fluctuating environment, the organism must have an in- 
ternal representation Aj nt that provides some minimum 
amount of information I(X lnt ; X[ utUT e) about the future 
states of that environment; the curve /( XiYit \ A^uturc) vs. 
F is a version of the rate-distortion curve. Building and 
acting upon this internal representation, however, en- 
tails various costs, and these can all be assigned to the 
construction of the representation out of the (past) data 
as they are collected; the curve of I(X int \ X vast ) vs. C 
is an example of the channel capacity. Finally, the in- 
formation bottleneck principle tells us that there is an 
optimum choice of internal representation which maxi- 
mizes /(X int ;Xf uturo ) at fixed I(X int ; X past ). 

The four interconnected optimization principles cer- 
tainly have to be consistent with one another. Thus, if 
an organism wants to achieve a certain mean fitness, it 
needs a minimum number of bits of predictive power, and 
this requires collecting a minimum number of bits about 
the past, which in turn necessitates some minimum cost. 
The possible combinations of cost and fitness — the ac- 
cessible region of the biologically meaningful tradeoff in 
Fig [T] — thus have a reflection in the "information plane" 
(the lower left quadrant of Fig [3]) where we trade bits 
about the future against bits about the past. 

The consistency of the different optimization princi- 
ples means that the purely information theoretic tradeoff 
between bits about the future and bits about the past 
must constrain the biologically optimal tradeoff between 
resources and fitness. We would like to make "constrain" 
more precise, and conjecture that under reasonable con- 
ditions organisms which operate at the biological opti- 
mum (that is, along the bounding curve in the upper 
right quadrant of Fig [3| also operate along the informa- 
tion theoretic optimum (the bounding curve in the lower 
left quadrant of Fig [3]) . At the moment this is only a 
conjecture, but we hope that the relationships in Fig [3] 
open a path to some more rigorous connections between 
information theoretic and biological quantities. 

VI. A UNIFYING PRINCIPLE? 

The optimization principle in Eq ([I]) is very abstract; 
here we consider two concrete examples. First imagine 
that we observe a Gaussian stochastic process [x(t)) that 
consists of a correlated signal [s(t)} in a background of 
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white noise [??(<)]• For simplicity, let's understand 'cor- 
related' to mean that s(t) has an exponentially decaying 
correlation function with a correlation time r c . Thus, 
x(t) = s(t) + r](t), where 

(s(t)s(t')) = a 2 cxp(-\t-t'\/r c ) (2) 
{ V {t)r,(t')) = M 5{t-t'), (3) 

and hence the power spectrum of x(t) is given by 

(x(t)x(t')) = fpS x (Lj)exp[-icj(t-t')} (4) 



2<? 2 t c 

i + K) : 



(5) 



The full probability distribution for the function x(t) is 



P[x(t)} = ^exp 



1 



dt / dt' x(t)K(t - t')x(t') 



where Z is a normalization constant and the kernel 
dui 1 



K(t) 



exp(— ujt). 



(6) 
(7) 



2tt S x (lo) 

If we sit at t = Q, then X past = a;(t < 0) and Xf uture 
' > 0). In the exponential of Eq (JTjj), mixing between 

Xp as t and Xf u turo is confined to a term which can be 

written as 



dtg(-t)x(t) 



dt' g{t')x{t') 



(8) 



where g(t) — exp(— t/r ), with t = t c (1+<t 2 t c / Wo) -1 / 2 . 
This means that the probability distribution of -X"f utU re 
given Xp as t depends only on x(t) as seen through the 
linear filter g(r), and hence only this filtered version of 
the past can contribute to X- lnt |2"2"] . 

The filter g(t) is exactly the filter that provides op- 
timal separation between the signal s(t) and the noise 
rj(t); more precisely, given the data X past , if we ask for 
the best estimate of the signal s(t), where "best" means 
minimizing the mean-square error, then this optimal es- 
timate is just y(t) |24j . Solving the problem of optimally 
representing the predictive information in this time series 
thus is identical to the problem of optimally separating 
signal from noise. 

In contrast to these results for Gaussian time series 
with finite correlation times, consider what happens we 
look at a time series that has essentially infinitely long 
correlations. Specifically, consider an ensemble of possi- 
ble experiments in which points drawn indepen- 
dently and at random from the probability distribution 
P(x\a), where a is a if -dimensional vector of param- 
eters specifying the distribution. At the start of each 
experiment these parameters are drawn from the distri- 
bution P(a) and then fixed for all n. Thus the joint 
distribution for many successive observations on x on 



one experiment is given by 
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P{x 1 ,x 2 ,---,x M ) = J d K aP{a)f[P(x n \a). (9) 



Now we can define X past = x 2 , ■ • ■ , xpj} and 

Af u turc = {xn+i,x n+2 , ■••,%}, and we can (optimisti- 
cally) imagine an unbounded future, M — > oo. To find 
the optimal representation of predictive information in 
this case we need a bit more of the apparatus of the 
information bottleneck [2"T] . 

It was shown in Ref |21j that an optimization problem 
of the form in Eq (JT|) can be solved by probabilistic map- 
pings X past — > X- lnt provided that the distribution which 
describes this mapping obeys a self-consistent equation, 



P(X int \X, 



1 



past ) 



^(Ap as t; A) 



^£>KL(^ p ast; 



( 10 ) 

where Z is a normalization constant and 
DKh(X pas t', Xi n t) is the Kullback-Leibler divergence 
between the distributions of X[ ut 
and Xi n t, respectively 



Ulc conditional on X past 



DKL(X past ; X[ nt ) 



3 P ( Xf ut mo | X past ) 



X 111 



■P(-^future|-<^past) 
P (-^future I -^int) 



(11) 



Since the future depends on our internal representation 
only because this internal representation is built from 
observations on the past, we can write 



P (-^future | -^int) — / DX pas tP{Xi vAnl:c \X pas \ i ) 



x -f > (AL p ast|^int) 

P(X pas t) 



P(X pas t\Xi nt ) — -P(^intl^past) 



P(X iDt ) ' 



(12) 
(13) 



which shows that Eq ( 10 I really is a self consistent equa- 
tion for P(Xi nt \X plls ty To solve these equations it is 
helpful to realize that they involve integrals over many 
variables, since X past is N dimensional and Xf utuie is M 
dimensional. In the limit that these numbers are very 
large and the temperature-like parameter A is very small, 
it is plausible that the relevant integrals are dominated 
by their saddle points. 

In the saddle point approximation one can find so- 
lutions for P(Xi nt \X past ) that have the following sug- 
gestive form. The variable X- lnt can be thought of as 
a point in a if -dimensional space, and then the distri- 
butions P(Xi nt \X past ) are Gaussian, centered on loca- 
tions a os t(X plLS t) , which are the maximum a posteriori 
Bayesian estimates of the parameters a given the obser- 
vations {xi, x 2 , • • • , xn}. The covariance of the Gaussian 
is proportional to the inverse Fisher information ma- 
trix, reflecting our certainty about a given the past data. 
Thus, in this case, if we solve the problem of efficiently 
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representing predictive information, then we have solved 
the problem of learning the parameters of the probabilis- 
tic model that underlies the data we observe. 

Signal processing and learning usually are seen as very 
different problems, especially from a biological point of 
view. Building an optimal filter to separate signal from 
noise is a "low-level" task, presumably solved in the very 
first layers of sensory processing. Learning is a higher 
level problem, especially if the model we arc learning 
starts to describe objects far removed from raw sense 
data, and presumably happens in the cerebral cortex. 
Returning to the problem of placing value on informa- 
tion, separating signal from noise by filtering and learn- 
ing the parameters of a probabilistic model seem to be 
very different goals. In the conventional biological view, 
organisms carry out these tasks at different times and 
with different mechanisms because the kinds of infor- 
mation that one extracts in the two cases have different 
value to the organism. What we have shown is that there 
is an alternative, and more unfiied, view. 

There is a single principle efficient representation of 
predictive information — that values all (predictive) bits 
equally but in some instances corresponds to filtering 
and in others to learning. In this view, what determines 
whether we should filter or learn is not an arbitrary "bio- 
logical" choice of goal or assignment of value, but rather 
the structure of the data stream to which we have access. 



VII. NEURAL CODING OF PREDICTIVE 
INFORMATION 

It would be attractive to have a direct test of these 
ideas. We recall that neurons respond to sensory stimuli 
with sequences of identical action potentials or "spikes," 
and hence the brain's internal representation of the world 
is constructed from these spikes [5]. More narrowly, if 
we record from a single neuron, then this internal rep- 
resentation Xi nt can be identified with a short segment 
of the spike train from that neuron, while X past and 
Xfuture are the past and future sensory inputs, respec- 
tively. The conventional analysis of neural responses 
focuses on the relationship between X pas t and Xi nt — 
trying to understand what features of the recent sensory 
stimuli are responsible for shaping the neural response. 
In contrast, the framework proposed here suggests that 
we try to quantify the information i(X; nt ; Xf uture ) that 
neural responses provide about the future sensory in- 
puts [25]. More specifically, to test the hypothesis that 
the brain generates maximally efficient representations 
of predictive information, we need to measure directly 
both I (X^ Xf u tuic) and I(X int ;X past ), and see whether 
in a given sensory environment the neural representation 
X; nt lies near the optimal curve predicted from Eq 0. 

It would seem that to measure I(X lnt ; X[ uturc ) we 
would have to understand the structure of the code by 
which spike trains represent the future; the same prob- 
lem arises even with I(X lnt ; X past ). In fact there is a 



more direct strategy [9]. The essential idea behind di- 
rect measurements of neural information transmission [5J 
is to use the (ir)reproducibility of the neural response to 
repeated presentations of the same dynamic sensory sig- 
nal. If we think of the sensory stimulus as a movie that 
runs from time t — to t — T, we can run the movie 
repeatedly in a continuous loop. Then at each moment 
t we can look at the response R = X- lnt of the neuron, 
and if there are enough repetitions of the movie we can 
estimate the conditional distribution P(X- m t\t); the en- 
tropy S n (t) of this distribution measures the "noise" in 
the neural response. On the other hand, if we average 
over the time t we can estimate P(Xj nt ), and the en- 
tropy Stotal of this distribution measures the capacity of 
the neural responses to convey information. In the limit 
of large T ergodicity allows us to identify averages over 
time with averages over the distribution out of which the 
stimulus movies are being drawn, and then 

1= Stotal ~ i [ T S n (t) (14) 

1 Jo 

is the mutual information between sensory inputs and 
neural responses. Since the neuron responds causally to 
its sensory inputs, the information that it carries about 
these inputs necessarily is information about the past, 
/(Xi nt ; Xp as t) Note that this computation does not 
require us to understand how to read out the encoded in- 
formation, or even to know which features of the sensory 
inputs are encoded by the brain. 

More careful analysis makes clear that the strategy 
in Ref [pj measures the information which X;„ t provides 
about whatever aspects of the sensory stimulus are being 
repeated. For example, if we have a movie with sound 
and we repeat the video but randomize the audio, then 
following the analysis of Ref [9 we would measure the 
information that neurons carry about their visual and 
not auditory inputs. Thus to measure I(X- mt ; Xf uture ) 
we need to generate sensory stimuli that are all drawn 
independently from the same distribution but are con- 
strained to lead to the same future, and then repro- 
ducible neural responses to these stimuli will reflect in- 
formation about the future. This can be done by a vari- 
ety of methods. 

Consider a time dependent signal Sk(i) generated on 
repeat k as 

+ *(*)=&(*), (15) 

where £k(t < 0) = £,o(t) for all k, while each > 0) 
is drawn independently; in the simplest case £(t) has no 
correlations in time (white noise). Then the correlation 
function of the signal becomes 

(«k(t)«k(f)) « exp(-|i - t'\/r c ), (16) 

and all Sk(i < 0) are identical. Now take the trajecto- 
ries and reverse the direction of time. The result is an 
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FIG. 4: Trajectories with a common future and their neural representation. Top left: Sample trajectories s^(t) designed to 
converge on a common future at t = 0, as explained in the text. Units are angular velocity, since these signals will be used to 
drive the motion-sensitive visual neuron HI. The correlation time r c = 0.05s and the variance (s 2 ) = (200°/s) 2 . Bottom left: 
Responses of the blowfly HI neuron to ninety different trajectories of angular velocity vs. time Sk(t) converge on a common 
future at t = 0; each dot represents a single spike generated by HI in response to these individual signals. Stimulus delivery 
and recordings as described in Ref \12\ . Top right: Probability per unit time of observing a spike in response to trajectories 
that converge on three different common futures. At times long before the convergence, all responses are drawn from the same 
distribution and hence have same spike probability within errors. Divergences among responses begin at a time ~ r c prior to 
common future. This divergence means that the neural responses carry information about the particular common future, as 
explained in the text. Error bars are standard errors of the mean, estimated by bootstrapping. Bottom right: Information in 
single time bins about the identity of the future, normalized as an information rate. Blue points are from the real data, and 
green points are from shuffled data that should have zero information. 



ensemble of trajectories that lead to the same future but 
are otherwise statistically independent, as in upper left 
panel in Fig |4j 

We have used the strategy outlined above to explore 
the coding of predictive information in the fly visual sys- 
tem, returning to the neuron HI. The extreme stability 
of recordings from HI has been exploited in experiments 
where we deliver motion stimuli by physically rotating 
the fly outdoors in a natural environment rather than 
showing movies to a fixed fly [T^], and this is the path 
that we follow here. 

We have generated angular velocity trajectories s(t) 
with a variance (s 2 ) = (200 °/s) 2 and a correlation time 



t c = 0.05 s by numerical solution of Eq ( 15 1. We choose 



nine such segments at random to be the common futures, 
and then follow the construction leading to the upper left 
panel in Fig [4] to generate ninety independent trajecto- 
ries for each of these common futures. These trajectories 



are used as angular velocity signals to drive rotation of 
a blowfly Calliphora vicina mounted on a motor drive as 
in Ref [T^] while we record the spikes generated by the 
HI neuron. 

The lower left panel in Figure [4] shows examples of 
the spike trains generated by HI in response to indepen- 
dent stimuli that converge on a common future. Long 
before the convergence, stimuli are completely different 
on every trial, and hence the neural responses are highly 
variable. As we approach the convergence time, stim- 
uli on different trials start to share features which are 
predictive of the common future, and hence the neural 
responses become more reproducible. Importantly, stim- 
ulus trajectories that converge on different common fu- 
tures generate responses that are not only reproducible 
but also distinct from one another, as seen in the upper 
right of Fig [4] Our task now is to quantify this distin- 
guishability by estimating I(X int ; X future ). 
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FIG. 5: Information about the future carried by spikes in a 
window that ends at the time of convergence onto a common 
future. Essentially this is the integral of the information rate 
shown in at the lower right in Fig [4] but error bars must be 
evaluated carefully. Shown in green are the results for shuffled 
data, which should be zero within errors if our estimates are 
reliable. 



Imagine dividing time into small bins of duration At. 
For small At, we observe either one spike or no spikes, so 
the neural response is a binary variable. If we sit at one 
moment in time relative to the convergence on a common 
future, we observe this binary variable, and it is associ- 
ated with one of the nine possible futures. Thus there is 
a 2 x 9 table of futures and responses, and it is straight- 
forward to use the experiments to fill in the frequencies 
with which each of these response/future combinations 
occurs. With 810 samples to fill in the frequencies of 18 
possible events, we have reasonably good sampling and 
can make reliable estimates of the mutual information, 
with error bars, following the methods of Refs [5J 12"?] . 
In each time bin, then, we can estimate the information 
that the neural response provides about the future, and 
we can normalize this by the duration of the bin At to 
obtain an information rate, as shown at the lower right 
in Fig [4] We see that information about sensory signals 
in the future (t > 0) is negligible in the distant past, as 
it must be, with the scale of the decay set by the corre- 
lation time t c . The local information rate builds up as 
we approach t = 0, peaking for this stimulus ensemble 
at ~ 30bits/s. 

The results in Fig|4]provide a moment by moment view 
of the predictive information in the neural response, but 
we would like a slightly more integrated view: if we sit at 
t = and look back across a window of duration T, how 
much predictive information can we extract from the re- 
sponses in this window? The complication in answering 



this question is that the neural response across this win- 
dow is a T/AT-letter binary word, and the space of these 
words is difficult to sample for large T. The problem 
could be enormously simpler, however, if the informa- 
tion carried by each spike were independent, since then 
the total information would be the integral of the local 
information rate. This independence isn't exactly true, 
but it isn't a bad approximation under some conditions 
[10] . and we adopt it here. Then the only remaining 
technical problem is to be sure that small systematic er- 
rors in the estimate of the local information rate don't 
accumulate as we compute the time integral, but this can 
be checked by shuffling the data and making sure that 
the shuffled data yields zero information. The results of 
this computation are shown in Fig [5} 

Should we be surprised by the fact that the neural 
response from this one single neuron carries somewhat 
more than one bit of information about the future? Per- 
haps not. This is, after all, a direction selective motion 
sensitive neuron, and because of the correlations the di- 
rection of motion tends to persist; maybe all we have 
found is that the neuron encodes the current sign of the 
velocity, and this is a good predictor of the future sign, 
hence one bit, with perhaps a little more coming along 
with some knowledge of the speed. We'd like to suggest 
that things are more subtle. 

Under the conditions of these experiments, the signal- 
to-noise in the fly's retina is quite high. As explained in 
Ref [17], we can think of the fly's eye as providing an es- 
sentially perfect view of the visual world that is updated 
with a time resolution At on the scale of milliseconds. 
But if we are observing a Gaussian stochastic process 
with a correlation time of t c , then the limit to prediction 
is set by the need to extrapolate across the 'gap' of du- 
ration At. Since the exponentially decaying correlation 
function corresponds to a Markov process, this limiting 
predictive information is calculable simply as the mutual 
information between two samples separated by the gap; 
the result is 



/n 



1 



log 2 



1 



1 - exp(-2Ai/V c ) 



(17) 



Plugging in the numbers, we find that, for these stim- 
uli, capturing roughly one bit of predictive information 
depends in an essential way on the system have a time 
resolution of better than ten milliseconds, and the ob- 
served predictive information requires resolution in the 
3 — 4 ms range. 

When we look back at a window of the neural response 
with duration T, we expect to gain i?; n f T bits of infor- 
mation about the stimulus [28 , and as noted above this 
necessarily is information about the past. Thus the x— 
axis of Fig [5j which measures the duration of the win- 
dow, can be rescaled, so that the whole figure is a plot of 
information about the future vs information about the 
past, as in the lower left quadrant of Fig [3] Thus we can 
compare this measure of neural performance with the 
information bottleneck limit derived from the statistical 
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structure of the visual stimulus itself; since the stimulus 
is a Gaussian stochastic process, this is straightforward 
[23] , and as above we assume that the fly has access to 
a perfect representation of the velocity vs. time, up- 
dated at multiples of the time resolution At. With At 
in the range of 3 — 4 ms to be consistent with the total 
amount of predictive information captured by the neu- 
ral response, we find that the performance of the neuron 
always is within a factor of two of the bottleneck limit, 
across the whole range of window sizes. 

VIII. DISCUSSION 

We should begin our discussion by reminding the 
reader that, more than most papers, this is a report on 
work in progress, intended to capture the current state of 
our understanding rather than to draw firm conclusions. 

Efficient representation? Although one should be 
careful of glib summaries, it does seem that the fly's 
visual system offers concrete evidence of the brain build- 
ing representations of the sensory world that are efficient 
in the sense defined by information theory. The absolute 
information rates are large (especially in comparison to 
prior expectations in the field!), and there are many signs 
that the coding strategy used by the brain is matched 
quantitatively to the statistical structure of sensory in- 
puts, even as these change in time. This matching, which 
gives us a much broader view of "adaptation" in sensory 
processing, has now been observed directly in many dif- 
ferent systems [29] . 

Why are these bits different from all other bits? Con- 
trary to widespread views in the neuroscience commu- 
nity, information theory does give us a language for 
distinguishing relevant from irrelevant information. We 
have tried to argue that, for living organisms, the crucial 
distinction is predictive power. Certainly data without 
predictive power is useless, and thus 'purifying' predic- 
tive from non-predictive bits is an essential task. Our 
suggestion is that this purification may be more than 
just a first step, and that providing a maximally effi- 
cient representation of the predictive information can be 
mapped to more biologically grounded notions of opti- 



mal performance. Whether or not this general argument 
can be made rigorous, certainly it is true that extracting 
predictive information serves to unify the discussion of 
problems as diverse as signal processing and learning. 

A new look at the neural code? The traditional ap- 
proach to the the analysis of neural coding tries to cor- 
relate (sometimes in the literal mathematical sense) the 
spikes generate by neurons with particular features of 
the sensory stimulus. But, because the system is causal, 
these features must be features of the organism's recent 
past experience. Our discussion of predictive informa- 
tion suggests a very different view, in which we ask how 
the neural response represents the organism's future sen- 
sory experience. Although there are many things to be 
done in this direction, we find it exciting that one can 
make rather direct measurements of the predictive power 
encoded in the neural response. 

From an experimental point of view, the most com- 
pelling success would be to map neural responses to 
points in the information plane information about the 
future vs information about the past — and find that 
these points are close to the theoretical optimum de- 
termined by the statistics of the sensory inputs and the 
information bottleneck. We are close to being able to do 
this, but there is enough uncertainty in our estimates of 
information (recall that we work only in the approxima- 
tion where spikes carry independent information) that 
we are reluctant to put theory and experiment on the 
same graph. Our preliminary result, however, is that 
theory and experiment agree within a factor of two, en- 
couraging us to look more carefully. 
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